Replacing discontinued Big Tech mobility reports: a penetration-based analysis

People mobility data sets played a role during the COVID-19 pandemic in assessing the impact of lockdown measures and correlating mobility with pandemic trends. Two global data sets were Apple’s Mobility Trends Reports and Google’s Community Mobility Reports. The former is no longer available online, while the latter is no longer updated since October 2022. Thus, new products are required. To establish a lower bound on data set penetration guaranteeing high adherence between new products and the Big Tech products, an independent mobility data set based on 3.8 million smartphone trajectories is analysed to compare its information content with that of the Google data set. This lower bound is determined to be around 10−4 (1 trajectory every 10,000 people) suggesting that relatively small data sets are suitable for replacing Big Tech reports.

www.nature.com/scientificreports/ Metrics M 1 and M 2 are provided at daily temporal resolution and at different levels of temporal smoothing, uncertainty included. Mobility metrics time series are correlated with time series of Google's product in order to compare their mutual information content and to assess a lower bound on the number of smartphone trajectories (with respect to the country population size) which guarantees a high adherence between products.
Mobility data analysed in this work are elaborated daily by EQN and regularly made available on Zenodo through the MobMeter 42 data set. To the best of our knowledge, MobMeter is currently the only open source mobility data set covering multiple global countries and an extensive time period up to present. MobMeter is thus a unique instrument for researchers who may benefit from including mobility patterns as covariates in their statistical or stochastic models, with applications in multiple fields including epidemiology, climatology and economics.

Results
Global long-term mobility trends. Mobility metrics and their smoothed versions are computed for the aforementioned countries. Figure 1 shows polar plots of M 1 and M 2 time series based on a 14-day temporal smoothing. All countries show significant decrements in the M 1 metric and significant increments in the M 2 metric during the few months after March 11, 2020 (initial phase of the COVID-19 pandemic). Differences between countries are observed in the temporal rapidity of the subsequent "recovery". Some exhibit "lobed" polar plots. This is the case for GRC, ITA and TUR, which show fast recoveries during the summer of 2020 and a contraction of people mobility during the subsequent winter. All other countries exhibit "spiralling" polar plots, which is a sign of a slow recovery. This behaviour is clear for South American countries like ARG, COL, PER and VEN.
Short-local events in mobility metrics. M 1 and M 2 time series are also affected by short-term local events. Figure 2 shows how mobility metrics significantly changed during social unrest in ECU and PAN and when a hurricane made landfall in MEX. In ECU, social protests 43 occurred between June 13 and June 30. M 1 dropped from approximately 20 km to 11.5 km, while M 2 raised from approximately 32% to 40%. In PAN, social protests 44 broke out on July 16 and lasted nearly two weeks. M 1 dropped from around 30 km to 17 km, while M 2 raised from approximately 27% to 32%. Between May 30 and 31, 2022, Hurricane Agatha hit the Oaxaca state of MEX with flooding and mudslides that killed at least 10 people and left 20 missing 45 . M 1 dropped from an average of around 28 km (on the previous days) to around 22 km. This drop is seen at the country level, so it was mitigated by the relatively small area (with respect to the area of MEX) impacted by the hurricane. M 2 did not change significantly with respect to the previous day.

Comparison with Google's Community Mobility Reports.
A comparison is made between metrics M 1 and M 2 and the mobility indices of the Google product. Google's mobility indices are given as variations (in percentage) in the number of visits to categories of places with respect to a baseline. This means that Google's indices do not carry exactly the same information of M 1 and M 2 , nor the unit of measure is the same. Nonetheless, all indices are based on smartphone spatio-temporal trajectories, and their linear correlation is expected to be medium-high. M 1 is compared with Google's "Transit stations", "Parks" and "Retail and recreation" indices while M 2 with Google's "Residential" and "Workplaces" indices. It is expected that "Transit stations", "Parks" and "Retail and recreation" indices positively correlate with the daily average distance travelled by people, and that the "Residential" and the "Workplaces" indices correlate (positively and negatively, respectively) with the percentage of people who did not move during the 24 h of the day. Figure 3 shows, for each country and for different levels of smoothing of the time series, the linear correlation between the metrics and Google's indices. For most countries, correlations are high and significantly increase when moving from no smoothing to a 7-day moving average smoothing. The lowest correlations are exhibited by NIC, SVL and PHL, which are among the countries with the lowest average number of daily smartphone trajectories (average sample size) in the data set (see Fig. 4). Robustness of above correlations is tested in the Methods section.
Data set penetration analysis. The relationship between correlations and average sample size is better described by accounting for country population size. The average sample size is computed for each country for the period from March 11, 2020, to September 22, 2022, and divided by country population size. This gives the average data set penetration in each country.
Under the non-smoothing case, for each correlation between metrics and indices the 17 average penetration values are related with the 17 correlations using a beta regression (see Methods section). Figure 5 shows the data and modelling results. In general, the higher the average penetration, the higher the correlation between indices. Additionally, average penetrations between around 10 −4 (i.e., 1 smartphone trajectory every 10,000 people) delimit the transition between medium (~ 0.5) to high (> 0.7) correlations.

Discussion
Mobility metrics and indices are proxies of societal health and stress. Their constant monitoring informs assessment of the impact of both long-and short-term adverse events. Thus, location data collected by smartphone apps play an important role, as they carry information on people mobility. With Apple's Mobility Trends Reports discontinued and Google's Community Mobility Reports no longer updated, it is crucial to understand whether similar products can be produced from alternative mobility data sets and made available to the scientific community. www.nature.com/scientificreports/ This study reconstructs smartphone trajectories from an independent mobility data set provided by the Earthquake Network citizen science initiative. These allow estimation of country-level mobility metrics with relatively high correlations to similar indices by Google. The average data set penetration among the country www.nature.com/scientificreports/ population requires 1 smartphone trajectory for every 10,000 people. This prove that relatively small data sets are suitable for replacing country-level Big Tech products when their analysis is based on a sound methodology which also provides measures of uncertainty.
Since it happened with Google's and Apple's products, it is meaningful to discuss the possible end-of-life of the MobMeter data set. While there is no guarantee that the Earthquake Network initiative will last years or decades, it is likely that citizens' interest in earthquakes and in the EQN smartphone app will not vanish in the foreseeable future. On the other hand, personal mobility data are inevitably connected to the privacy and the security of the individual 46 . The collection of personal data by smartphone apps is usually regulated by the app-store owner, with Google and Apple, again, as the main actors. This implies that new rules and limits on how personal data are collected may come into place at any time, making the publishing of MobMeter hard or impossible (long before Earth's seismicity will fade away).

Methods
Trajectory description. The mobility data set analysed in this work includes smartphones trajectories collected from the 17 countries during the period from March 11, 2020, to September 22, 2022 (926 days). Each trajectory covers any possible subset of the 926 days (from only one day to the full period).
In general, the k th trajectory, 1 ≤ k ≤ K , is composed of M k observations, with K the total number of trajectories. The m th observation, 1 ≤ m ≤ M k , of the k th trajectory is given by.
where ID k is the anonymized smartphone/trajectory identifier, lat m and lon m are the latitude and longitude smartphone coordinates, u m is the uncertainty on smartphone coordinates, and t m is the timestamp.
Uncertainty u m is in metres and represents the standard deviation (sigma) of two independent normal distributions centred on each smartphone coordinate. Timestamp t m is based on the country local time and refers to the time lat m and lon m are observed by the smartphone. Timestamps have the following constraint: min(t m+1 − t m ) ≥ 20 min.
(1) (ID k , lat m , lon m , u m , t m ), For any two consecutive trajectory observations, the geodetic distance l m,m+1 between (lat m , lon m ) and (lat m+1 , lon m+1 ) is computed. The following transformation is then applied: The transformation implies that the estimated travelled distance l m,m+1 between time t m and t m+1 is greater than zero only if the 1-sigma uncertainty disks do not overlap. The daily travelled distance by the k th smartphone is given by: where I t m ∈ τ d , τ d+1 is 1 if timestamp t m is within day d and 0 otherwise (with τ d the timestamp related with midnight of day d ). Since l M k ,M k +1 is computed using the last trajectory observation of day d and the first of day d + 1 , Eq. (3) implies that any travel occurring across midnight contributes to the total travelled distance on day d.
(3) L d,i,k = M k m=1 l m,m+1 I t m ∈ τ d , τ d+1 , Figure 3. Linear correlation between the M 1 mobility metric and Google's "Transit stations" index, between M 1 and Google's "Parks" index, between M 1 and Google's "Retail and Recreation" index, between the M 2 mobility metric and Google's "Residential" index and between the M 2 and Google's "Workplaces" index. 1-day moving average means no smoothing. All correlations between M 2 and Google's "Workplaces" index are negative but they are reported without sign to assure readability of the plots. In brackets, the average data set penetration in each country. For each country, first-level administrative divisions (regions) are considered. Each region R i,c is described in terms of multiple polygons defined in the geographical space. These polygons are freely available for download at diva-gis.org/gdata. The number of first-level administrative divisions by country is given in Table 1.
The daily average travelled distance for the c th region of the i th country is given by: where I lat k , lon k ∈ R i,c is equal to 1 if the daily average smartphone coordinates lat k , lon k are in the R i,c region, and 0 otherwise. N d,i,c is the number of daily trajectories in the c th region. Let p i,c be the population count of the c th region, 1 ≤ c ≤ C i . The daily average distance for the i th country (mobility metric M 1 ) is given by  www.nature.com/scientificreports/ is a weight based on the region population. The adoption of this weighting approach is dictated by three reasons: (1) the spatial distribution of smartphone-app users does not necessarily mimic the population distribution; (2) events affecting people mobility may be limited to some regions, or their strength vary across regions 47 ; (3) in general, a weighting approach based on a population stratification helps reduce the bias of estimates 48 . By replacing L d,i,k with U d,i,k in Eq. (5) and following the same procedure described above, the mobility metric M 2 (i.e., U d,i ) is computed for each day and country.
Uncertainty assessment. Uncertainty on daily M 1 and M 2 figures (i.e., L d,i and U d,i , respectively) is assessed using a non-parametric bootstrap approach 49

Comparison with Google's Community Mobility Reports. Community Mobility Reports by Google
gives percentages of variation in the number of visits to place categories with respect to a baseline. The categories are "Retail and recreation", "Grocery and pharmacy", "Parks", "Transit stations", "Workplaces" and "Residential". For each category, time series of percentages of variation are available at both country and regional levels with daily temporal resolution. M 1 and M 2 time series are compared with Google's country-level time series by computing linear correlations. M 1 is correlated with "Transit stations", "Parks" and "Retail and recreation" indices while M 2 with "Workplaces" and "Residential" indices. Comparison is made using both non-smoothed and smoothed time series (i.e., q ∈ {7, 14, 21, 28} ). A highly positive or highly negative correlation means that M 1 and/or M 2 carry information on people mobility similar to that of Google's community mobility reports.

Robustness analysis.
To test robustness of the comparison described in the previous section, a timeshifted correlation analysis is also implemented. Whenever two time series are considered, one time series is shifted by a lag of � = −14, . . . , 0, . . . , 14 days and linear correlation is computed. Figure 6 shows results for the 17 countries.
For countries with a relatively high average data set penetration (see for instance ARG, CHL and PER), correlations are maximum when = 0 and, due to the weekly cycle, they also tend to be high when = −14, −7, 7, 14 . When penetration is lower this behaviour is disrupted and the maxima are not necessarily at = 0 (or not all of them are at = 0 ). TUR is an exception since peaks of the time-shifted correlation graph are located where expected despite the low penetration.

Beta regression on correlations vs average penetration.
A beta regression is adopted to describe the relationship between the average data set penetration and the correlations without sign |ρ| between the nonsmoothed M 1 and M 2 metrics and Google's indices. Beta regression is imposed by |ρ| ∈ [0, 1].
For the generic i th country, |ρ i | ∼ B(µ i , φ) , where φ is the precision parameter of the beta distribution and g(µ i ) = x i ′ β , with g the logit link function, x i the vector of regressors and β the vector of unknown model parameters. Here, x i = 1, log 10 (π i ) ′ , with π i the average data set penetration for the i th country. Model fitting capability is described by the pseudo coefficient of determination R 2 = corr |ρ|, |ρ| 2 , with |ρ| the model estimate. An F-test on the regression model is used to tests whether the model fits significantly better than a model with only the constant term (i.e., x i = 1).

Sensitivity analysis.
Estimates of M 1 and M 2 are based on three arbitrary choices. First, only trajectories with at least n observations within a span of at least n hours are used ( n = 12 ). Second, l m,m+1 = l m,m+1 only if l m,m+1 ≥ ru m + ru m+1 ( r = 1 , see Eq. (2)). Third, U d,i,k = 1 if L d,i,k < z ( z = 0.2km , see Eq. (4)).
The choice of n affects both M 1 and M 2 , while the choice of r and z only affects M 2 . Values used in this work are the result of a sensitivity analysis. Considering ITA and the period from March 11, 2020, to September 22, 2022, the correlation without sign |ρ 1 | between M 1 (non-smoothed) and the "Transit stations" index by Google and the correlation without sign |ρ 2 | between M 2 (non-smoothed) and the "Residential" index by Google are estimated for each combination of n ∈ {3, 6, 9, 12, 15} , r ∈ {1, 2, 3} and z ∈ {0.1, 0.2, 0.3, 0.4}km.